NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

FETILDA: An Evaluation Framework for Effective Representations of Long Financial Documents

https://doi.org/10.1145/3657299

Xia, Bolun Namir; Rawte, Vipula; Gupta, Aparna; Zaki, Mohammed (April 2024, ACM Transactions on Knowledge Discovery from Data)

In the financial sphere, there is a wealth of accumulated unstructured financial data, such as the textual disclosure documents that companies submit on a regular basis to regulatory agencies, such as the Securities and Exchange Commission (SEC). These documents are typically very long and tend to contain valuable soft information about a company’s performance that is not present in quantitative predictors. It is therefore of great interest to learn predictive models from these long textual documents, especially for forecasting numerical key performance indicators (KPIs). In recent years, there has been a great progress in natural language processing via pre-trained language models (LMs) learned from large corpora of textual data. This prompts the important question of whether they can be used effectively to produce representations for long documents, as well as how we can evaluate the effectiveness of representations produced by various LMs. Our work focuses on answering this critical question, namely the evaluation of the efficacy of various LMs in extracting useful soft information from long textual documents for prediction tasks. In this paper, we propose and implement a deep learning evaluation framework that utilizes a sequential chunking approach combined with an attention mechanism. We perform an extensive set of experiments on a collection of 10-K reports submitted annually by US banks, and another dataset of reports submitted by US companies, in order to investigate thoroughly the performance of different types of language models. Overall, our framework using LMs outperforms strong baseline methods for textual modeling as well as for numerical regression. Our work provides better insights into how utilizing pre-trained domain-specific and fine-tuned long-input LMs for representing long documents can improve the quality of representation of textual data, and therefore, help in improving predictive analyses.
more » « less
Full Text Available
ProKnow: Process knowledge for safety constrained and explainable question generation for mental health diagnostic assistance

https://doi.org/10.3389/fdata.2022.1056728

Roy, Kaushik; Gaur, Manas; Soltani, Misagh; Rawte, Vipula; Kalyan, Ashwin; Sheth, Amit (January 2023, Frontiers in Big Data)

Virtual Mental Health Assistants (VMHAs) are utilized in health care to provide patient services such as counseling and suggestive care. They are not used for patient diagnostic assistance because they cannot adhere to safety constraints and specialized clinical process knowledge ( ProKnow ) used to obtain clinical diagnoses. In this work, we define ProKnow as an ordered set of information that maps to evidence-based guidelines or categories of conceptual understanding to experts in a domain. We also introduce a new dataset of diagnostic conversations guided by safety constraints and ProKnow that healthcare professionals use ( ProKnow - data ). We develop a method for natural language question generation (NLG) that collects diagnostic information from the patient interactively ( ProKnow - algo ). We demonstrate the limitations of using state-of-the-art large-scale language models (LMs) on this dataset. ProKnow - algo incorporates the process knowledge through explicitly modeling safety, knowledge capture, and explainability. As computational metrics for evaluation do not directly translate to clinical settings, we involve expert clinicians in designing evaluation metrics that test four properties: safety, logical coherence, and knowledge capture for explainability while minimizing the standard cross entropy loss to preserve distribution semantics-based similarity to the ground truth. LMs with ProKnow - algo generated 89% safer questions in the depression and anxiety domain (tested property: safety ). Further, without ProKnow - algo generations question did not adhere to clinical process knowledge in ProKnow - data (tested property: knowledge capture ). In comparison, ProKnow - algo -based generations yield a 96% reduction in our metrics to measure knowledge capture. The explainability of the generated question is assessed by computing similarity with concepts in depression and anxiety knowledge bases. Overall, irrespective of the type of LMs, ProKnow - algo achieved an averaged 82% improvement over simple pre-trained LMs on safety, explainability, and process-guided question generation. For reproducibility, we will make ProKnow - data and the code repository of ProKnow - algo publicly available upon acceptance.
more » « less
Full Text Available
The Troubling Emergence of Hallucination in Large Language Models - An Extensive Definition, Quantification, and Prescriptive Remediations

https://doi.org/10.18653/v1/2023.emnlp-main.155

Rawte, Vipula; Chakraborty, Swagata; Pathak, Agnibh; Sarkar, Anubhav; Tonmoy, SM_Towhidul Islam; Chadha, Aman; Sheth, Amit; Das, Amitava (January 2023, Association for Computational Linguistics)

Full Text Available
Fetilda: An effective framework for fin-tuned embeddings for long financial text documents

Xia, Bolun; Rawte, Vipula D; Zaki, Mohammed J; Gupta, Aparna (June 2022, arXiv preprints)

Unstructured data, especially text, continues to grow rapidly in various domains. In particular, in the financial sphere, there is a wealth of accumulated unstructured financial data, such as the textual disclosure documents that companies submit on a regular basis to regulatory agencies, such as the Securities and Exchange Commission (SEC). These documents are typically very long and tend to contain valuable soft information about a company's performance. It is therefore of great interest to learn predictive models from these long textual documents, especially for forecasting numerical key performance indicators (KPIs). Whereas there has been a great progress in pre-trained language models (LMs) that learn from tremendously large corpora of textual data, they still struggle in terms of effective representations for long documents. Our work fills this critical need, namely how to develop better models to extract useful information from long textual documents and learn effective features that can leverage the soft financial and risk information for text regression (prediction) tasks. In this paper, we propose and implement a deep learning framework that splits long documents into chunks and utilizes pre-trained LMs to process and aggregate the chunks into vector representations, followed by self-attention to extract valuable document-level features. We evaluate our model on a collection of 10-K public disclosure reports from US banks, and another dataset of reports submitted by US companies. Overall, our framework outperforms strong baseline methods for textual modeling as well as a baseline regression model using only numerical data. Our work provides better insights into how utilizing pre-trained domain-specific and fine-tuned long-input LMs in representing long documents can improve the quality of representation of textual data, and therefore, help in improving predictive analyses.
more » « less
Using supervised learning techniques for entity relationships

https://doi.org/10.1145/3220547.3226044

Rawte, Vipula; Gupta, Aparna; Zaki, Mohammed J. (June 2018, Proceeding DSMM'18 Proceedings of the Fourth International Workshop on Data Science for Macro-Modeling with Financial and Economic Datasets)

Given different nancial data resources, it is very challenging to relate entities across the various resources since each resource has its own way of describing the entities and relationships. We work on identifying such relationships using context and available scores, using mainly supervised machine learning techniques to build classi fiers and predict new relationships or validate the existing ones based on the suitable measures of similarity.
more » « less
Full Text Available
Analysis of year-over-year changes in Risk Factors Disclosure in 10-K filings

https://doi.org/10.1145/3220547.3220555

Rawte, Vipula; Gupta, Aparna; Zaki, Mohammed J. (June 2018, Proceeding DSMM'18 Proceedings of the Fourth International Workshop on Data Science for Macro-Modeling with Financial and Economic Datasets)

Full Text Available
An Ontology for a Polymer Nanocomposite Community Data Resource

https://doi.org/10.1145/3091478.3098866

Rawte, Vipula; McCusker, James; Zhao, He; Brinson, L. Catherine; Chen, Wei; Schadler, Linda; McGuinness, Deborah L. (January 2017, WebSci2017)

Full Text Available

Search for: All records